Recover PGlite data dirs with corrupt WAL#994
Conversation
|
Chiming in with an independent reproduction and a deployment data point on this branch. Setup PGLite engine inside gbrain (a Postgres-native personal knowledge brain), used as the storage for a long-running HTTP MCP server under launchd on macOS 26.x (Darwin 25.5.0) arm64 / Bun 1.3.11. ~520-page corpus, 80MB on disk. Single-writer model: the daemon holds the exclusive lock for hours; CLI tools occasionally grab it briefly for inserts. Bug profile, without #994 Five
Pre-session organic rate: ~3 wedges in 5 days under normal operation, matching the launchd-managed MCP's restart cadence. During a 3-hour debug session with ~6 manual Failure signature is the one this PR's body describes:
Converged on the same diagnosis: launchd SIGTERM → PGLite-shutdown-checkpoint race leaves WAL torn at shutdown, replay can't reconcile on next open, Emscripten Branch deployment Built head The recovery path itself hasn't fired in anger yet — no wedge has occurred in the few hours since the overlay landed. Plan to follow up with an in-anger observation when the next wedge happens, specifically whether What I can offer back
Thanks for the work — the recovery scope (PG17-only layout, opt-out via |
This adds a recovery path for NodeFS data directories that fail during startup because Postgres reports corrupt WAL or checkpoint state. Instead of surfacing only an Emscripten abort, PGlite now captures the startup logs, resets WAL in the existing data directory, and retries startup once.
The reset code is intentionally narrow. It only runs for PGlite's PostgreSQL 17 layout, validates pg_control and WAL sizes, removes stale postmaster.pid and old WAL files, writes a replacement shutdown checkpoint record, and preserves the data files. Users can opt out with dataDirRepair: 'none'. The instance exposes repairedDataDir when recovery happened.
The regression test creates a database, inserts a row, truncates WAL, reopens the same data directory, and verifies the original row is still readable after recovery. It also checks that the opt out path still fails with a startup error.
Validation run: